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ABSTRACT 

The initial location of data in DRAMs is determined 
and controlled by the ‘address-mapping’ and even mod¬ 
ern memory controllers use a fixed and run-time-agnostic 
address mapping. On the other hand, the memory ac¬ 
cess pattern seen at the memory interface level will dy¬ 
namically change at run-time. This dynamic nature of 
memory access pattern and the fixed behavior of ad¬ 
dress mapping process in DRAM controllers, implied 
by using a fixed address mapping scheme, means that 
DRAM performance cannot be exploited efficiently. 

DReAM is a novel hardware technique that can de¬ 
tect a workload-specific address mapping at run-time 
based on the application access pattern which improves 
the performance of DRAMs. The experimental results 
show that DReAM outperforms the best evaluated ad¬ 
dress mapping on average by 9%, for mapping-sensitive 
workloads, by 2% for mapping-insensitive workloads, 
and up to 28% across all the workloads. DReAM can be 
seen as an insurance policy capable of detecting which 
scenarios are not well served by the predefined address 
mapping. 

1. INTRODUCTION 

Increasing the number of general purpose cores and 
accelerator cores (e.g. GPU cores) integrated into a sin¬ 
gle chip and competing for access to DRAM, demands 
better performance from the main memory. In this sit¬ 
uation, exploiting the maximum performance obtain¬ 
able from the memory system is crucial. However, due 
to the internal structure and organization of DRAMs, 
described in Section there is always some memory 
bandwidth (Performance) wasted due to internal con¬ 
flicts. One of the most serious conflicts in a DRAM 
memory system is referred to as ‘page conflict’. This 
happens when two consecutive memory requests go to 
different rows within the same bank. In this situation, 
these memory requests must be serviced one after an¬ 
other which causes a high access latency for the sec¬ 
ond request. Dealing with page conflicts becomes even 
more challenging considering the fact that they are com¬ 
pletely dependent on the memory access pattern. This 
means that the rate of page conflicts and the time of 
their occurrence change dynamically according to the 


application behavior. To mitigate the vulnerability of 
DRAMs performance to page conflicts, state-of-the-art 
memory controllers have evolved into complex hardware 
components employing subsystems such as schedulers. 
These schedulers take advantage of workload run-time 
information (the sequence of memory requests) to re¬ 
duce page conflicts. An important role of the sched¬ 
uler is to minimize DRAM page conflicts by reordering 
the memory commands that are available to issue to 
the DRAM. However, the main limitation for sched¬ 
ulers is the number of options (memory requests) that 
they have to choose from at the time of scheduling. In 
general, the number of available memory requests at the 
time of scheduling is limited by data dependencies be¬ 
tween memory requests, the number of running threads, 
the number of cores etc. Therefore, there are conflicts 
that schedulers cannot eliminate. These page conflicts 
result from the address-mapping and data placement 
in DRAMs. As discussed in the next section, the ad¬ 
dress mapping is a process that maps the physical ad¬ 
dress bits provided by processors to the internal struc¬ 
ture of DRAMs. This process controls the initial data 
placement in memory. Thus, it is important to under¬ 
stand how to select a good address-mapping scheme to 
place and distribute data in DRAM devices to mitigate 
page conflicts. This is possible using a software-only 
approach; e.g. with OS support and intelligent memory 
allocators. However, this option faces complex problems 
when considering multiple independent applications ex¬ 
ecuting concurrently, or with virtualized scenarios (both 
hypervisors and containers) and relies on software being 
compiled for specific memory hardware. 

This paper presents DReAM, a novel hardware tech¬ 
nique based on approximating the entropy of each mem¬ 
ory address bit for a set of memory requests, to generate 
workload specific address-mappings at run-time. To re¬ 
arrange the address mapping at run-time DReAM needs 
to support the online-data migration imposed by chang¬ 
ing the address-mapping. DReAM investigates differ¬ 
ent scenarios for data migration with different levels of 
complication. The proposed solutions were evaluated 
over a wide range of mapping-sensitive and mapping- 
insensitive workload mixes. Three different address map¬ 
ping schemes were evaluated over all the workloads and 


the best one was chosen to compare against DReAM. 
Overall, DReAM is the first on-the-fly mechanism ca¬ 
pable of generating workload specific address-mappings 
without requiring to stop the running applications. 

2. BACKGROUND 

Figure [^presents the basic organization of a DRAM 
device. Each DRAM device consists of multiple banks 
each of which has a data array and one row buffer. In 
practice, the data array within a bank consists of mul¬ 
tiple subarrays, each of which has its own local row 
buffer. The local row buffers within a bank are con¬ 
nected to other local row buffers as well as the global 
row buffer. There are some interesting works by Chang 
et al [^, Kim et al and Seshadri et al to exploit 
these subarrays to improve the DRAM performance and 
bulk data copy in DR AMs. 
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Figure 1: DRAM device organisation. 


The address mapping mechanism for DRAMs trans¬ 
forms the flat ID of physical addresses into the inter¬ 
nal 2D structure of DRAMs devices (row & column). 
Figure illustrates how one physical address can be 
interpreted with two different mapping schemes. Most 
memory systems contain DIMMs and a DIMM can have 
multiple ranks of DRAMs. Multiple DIMMs can be 
placed on a channel; i.e. the physical connection be¬ 
tween a memory controller and DRAMs [^. The rea¬ 
son for these many hierarchical levels is to maximize the 
parallelism that can be exploited when servicing multi¬ 
ple memory requests. 
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Figure 2: Two different address mapping schemes. 

In general an address-mapping scheme extracts the 
corresponding address for Channel, Rank, Bank, Row 
and Column from the physical address. Due to the in¬ 
ternal structure and electronic circuit characterization 


of the DRAMs, consecutive access to different memory 
locations can have a different memory cost depending 
on the previous state of the memory. For instance, if 
there are two consecutive accesses to the same row in 
the same bank of a DRAM, the second access can have 
significantly smaller latency than the first access since 
the target row has been ‘opened’ by the first memory 
request. On the other hand, if there are two consecu¬ 
tive accesses to different rows within the same bank, 
the second access has significantly higher latency in 
comparison with the first access. The reason is that, 
in this case, the previous row must be ‘closed’ before 
the new row is ‘activated’. These scenarios describe a 
page conflict and degrades the overall performance of 
DRAMs. Page conflicts are sensitive to the data place¬ 
ment in DRAMs and data placement is determined by 
the address-mapping schemes in the first place. There¬ 
fore, choosing an address mapping scheme carefully can 
reduce the page conflicts and improve the performance 
of DRAMs. 

2.1 Motivation - Address Mapping Analysis 

Figure [^presents three different well-known address¬ 
mapping schemes currently employed by modern DRAM 
controllers. The first mapping (Figure [3a|) is a standard 
mapping intended to exploit the spacianocality by plac¬ 
ing column address at bottom. The next two address in¬ 
terleaving policies are schemes proposed by Kaseridis et 
al. and Zhang et al [^. The proposed mapping by 
Zhang et al XORs some of the row address bits with 
the bank’s address bits to produce a new bank index 
(Figure [ 3 ^. This tries to change the bank ID when¬ 
ever the Row ID is changed to reduce the page conflict. 
Kaseridis et al extend this technique by producing 
the column index using a different section of the physi¬ 
cal address (Figure [3^. Both techniques aim to reduce 
page conflicts in DRAMs. 

There might be other variation of address-mapping 
schemes, than those presented in this figure, that can 
be used to perform the required translation phase to ser¬ 
vice a memory request. However, the important point 
to consider is that the current memory controllers can 
only use one of such address-mapping schemes to trans¬ 
late the physical address to the internal structure of 
DRAMs. Moreover, modern DRAM controllers are lim¬ 
ited to perform read/write operations in bursts (typi¬ 
cally bursts of 4 or 8 items). This implies some bits 
are used as a block offset, presented in Figure To 
motivate the technique presented in this paper. Fig¬ 
ures [^presents the performance comparison of different 
address-mapping schemes for all the benchmark suites 
evaluated in this work. Each bar in these graphs rep¬ 
resents the normalized execution time to the baseline 
address-mapping scheme (address mapping 1 in Fig¬ 
ure]^. Our experimental results (considering the re¬ 
sults of an individual workload) suggest that a prede¬ 
fined address mapping schemes is not efficient in all sit¬ 
uations and thus employing a fixed address mapping 
scheme cannot deliver the best execution time across 
all workloads. As Figure suggests, the permutation- 
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Figure 4: Performance comparison among different address-mapping schemes. 
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(b) Mapping 2: Permutation-based Page Interleaving 
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(c) Mapping 3:Minimalist Open-Page Scheme 
Figure 3: Different address mapping schemes. 


based address mapping almost always (except for the 
BIOBENCH benchmark) delivers a better geometric av¬ 
erage (GMEAN) execution time compared with other 
two address mapping schemes. This address mapping 
is chosen as the best baseline of those presented in this 
paper to be compared against DReAM mapping. 


DRAM device (e.g. rank, bank, row, etc.). Therefore, 
a physical address is divided in to different sets of bits 
each set pointing to a specific part of internal hierarchy 
of the DRAM system. Considering consecutive requests 
to a DRAM module, the changing rate of each physical 
address bit (as a result of the changing rate of each bit 
within different sets) in comparison with the previous 
access has a strong correlation with the changing rate 
of a specific DRAM location that has been accessed. 
On the other hand, accessing different rows within the 
same bank causes page conflicts and imposes a power 
and performance overhead. Therefore, ideally, it is de¬ 
sired to keep the change rate of the physical address 
bits that are used to address the row, as low as possi¬ 
ble to reduce the row switches within a bank. DReAM 
estimates how much each physical address bit changes 
by observing memory requests over a period of time 
as a means of generating improved memory mappings. 
The estimations of change per bit require minimum ex¬ 
tra hardware; one counter per physical address bit per 
memory controller. Those bits changing the most have 
higher entropy and those bits changing the least have 
smaller entropy. Eor a given period, these counters (or 
frequency change estimators) keep track of the number 
of changes of each bit of the physical address in com¬ 
parison with the previous memory address request. The 
given period creates time windows and can be based on 
number of clock cycles or number of memory requests. 
Eigurej^ shows an example of five consecutive accesses 
to demonstrate the function of these counters. 


3. DREAM: DYNAMIC REARRANGEMENT 
OF ADDRESS MAPPING 

DReAM is a novel technique to analyze the mem¬ 
ory access pattern (produced either by single or multi¬ 
threaded applications) at run-time and estimate an ef¬ 
ficient address-mapping scheme, that reduces page con¬ 
flicts and improves page hits. DReAM consists of two 
main phases: ‘online prediction of address mapping’ and 
‘on-the-fly data migration’. 
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3.1 Online Prediction of Address Mapping 

The first step is to discover whether the current work¬ 
load, a set of executing applications, is a good match 
with the baseline address mapping scheme. A baseline 
address-mapping scheme decides which physical address 
bits should be used to address which specific part of a 


Eigure 5: Bit-counters mechanism. 

The counter value of the two highlighted bits shows 
that bit 15 and bit 26 have been changed once and 4 
times, respectively, in the last five memory requests. 
These counters generate a pattern (or signature) that 
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Figure 6: Extracted bit-change pattern for all benchmark suites. 
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is representative of the current memory access behav¬ 
ior as perceived by the memory controller. Figure 
shows such a signature extracted from these counters 
for all the benchmarks evaluated in this paper. The X- 
axis in each plot represents the corresponding counter 
ID per physical address bits and Y-axis shows the over¬ 
all bit change rate over the application execution time. 
There is an exponential growth in the rightmost five 
bits of almost all the patterns. This is due to spacial 
locality that implies accessing the sequential physical 
addresses. Looking at these pattern and the address¬ 
mapping schemes presented in Figure justifies why 
the column address bits are typically placed in the bot¬ 
tom of the physical address space. In this way, accessing 
consecutive cache lines will be mapped to the consecu¬ 
tive columns within the same row (i.e. Page Hit). 

3.1.1 Address Mapping Prediction 

Given the signature for a set of running applications, 
the next issue is how to generate an optimized address¬ 
mapping scheme. The idea is to map the physical ad¬ 
dress bits with low variation to rows (to reduce the row 
switching or page conflicts), the physical address bits 
with medium variation to banks and the physical ad¬ 
dress bits with highest rate of changes to columns to 
increase the locality and decrease the page conflicts. 
Moreover, it is possible to limit DReAM to rearrange 
only a part of the physical address bits to mitigate 
the associated cost of the address mapping change in 
DRAMs that will be discussed later (data migration). 
For instance, in this paper, DReAM does not rearrange 
the column-address bits to avoid cache-line-level mi¬ 
gration. To produce a new address mapping scheme 
at run-time, (i) the bit-change rate of physical address 
bit will be monitored for each time-window, (ii) a new 
address mapping scheme will be estimated based on 
each time-window monitoring information, (iii) the bit- 
change rate monitored, based on the predefined and 
new address mapping schemes for each time-window 
will be compared. If the new address-mapping scheme 
can improve the bit-change rate in comparison with 
the baseline address mapping above a desired (and pro¬ 
grammable) threshold (for consecutive time-windows de¬ 
fined by ‘Consistency Threshold’) then the new address 
mapping will be used as the primary address mapping 
scheme in the system. 

3.1.2 Mathematical Insight 

Intuitively, DReAM proposes a simple technique to 
detect an application-specific address mapping scheme 
based on the physical address bit-change monitoring 
process. However, the question is to find an analyti¬ 
cal proof to show that the application-specific address¬ 
mapping scheme predicted using this method can actu¬ 
ally improve the performance of the memory system. 

As discussed, the predicted address-mapping scheme 
will be exploited only if it can reduce the bit-change 
rate, in comparison with the baseline address mapping, 
beyond a certain threshold. This means that DReAM 
assumes that there is a correlation between the bit- 


change rate of physical address bits and the perfor¬ 
mance of DRAMs. To investigate this, the correlation 
coefficient between the average bit-changed improve¬ 
ment reported by DReAM and the performance im¬ 
provement of memory system, while using the DReAM 
address mapping, was investigated. The experimental 
results shows that there is a strong correlation, 0.89 
with a very small P-value (i.e. 1.97 x 10“^^), between 
the bit-change rate and the final performance improve¬ 
ment. This justifies why the predicted address map¬ 
ping scheme proposed by DReAM can improve the per¬ 
formance of DRAMs. Figure shows the bit-change 
rate improvement reported by DReAM and the final 
performance improvement achieved using the predicted 
address-mapping scheme by DReAM. 



Figure 7: Bit-change rate improvement vs. the overall 
performance improvement. 


3.1.3 Mapping-Sensitive Vs. Mapping-Insensitive 

Looking at the patterns presented in Figure and 
considering the basic principles behind the address map¬ 
ping prediction explained so far, it is possible to cate¬ 
gorise workloads based on their sensitivity to the ad¬ 
dress mapping. If there is an opportunity to swap a 
physical address bit with a high change rate (that corre¬ 
sponds to the row address space) with another physical 
address bit with a smaller change rate (that is not a part 
of the row address space) then this is called a mapping- 
sensitive workload. Otherwise, this is categorised as an 
mappin g-inse nsitive workload. For instance, ‘stream’ 
(Figure 6.25) is a mapping-insensitive workload since 
all the bits dedicated to the row address space have a 
smaller change rate t han o ther bits. On the other hand, 
‘libquantum’ (Figure [6. 39[ ) is a mapping-sensitive work¬ 
load since there is an opportunity to swap bit 14 with 
another bit with smaller change rate (let’s say bit 10). 


3.2 Data Migration Challenge 

Changing the address-mapping scheme of a DRAM, 
on-the-fly, has a very important obstacle which is the 
requirement for the Data Migration. Initially, a DRAM 
places data into memory based on a predefined address 
mapping scheme. Therefore, changing the address map¬ 
ping scheme implies that the data previously loaded into 
the DRAM cannot be accessed using the new address 
mapping scheme. Thus, before employing the new ad¬ 
dress mapping, the existing data in DRAMs must be 
migrated to a new location based on the new address 
mapping scheme. This imposes some overhead to the 
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overall performance of memory system. To alleviate this 
overhead this paper investigate two different scenarios 
explained as follows. 

3.3 Data Migration Solutions 

3.3.1 Scenario 1 - Offline Data Migration 

This scenario explains the simplest DReAM imple¬ 
mentation that imposes a minimal hardware overhead 
to the overall memory system. In general, this scenario 
is well suited for application-specific computer architec¬ 
tures, e.g. database systems, where a specific applica¬ 
tion is running on the system over and over. For in¬ 
stance, in a database system, depending on the type of 
database (e.g. financial, medical etc.), usually only a 
few specific queries with minor variations are used to 
search for specific data. Moreover, in the big-data re¬ 
search area running a query over a database might take 
a few days or weeks. This produces a specific memory 
access pattern in the system that usually is consistent 
over a long period of time. In this implementation, the 
memory access pattern of applications (single or mul¬ 
tithread) will be monitored at run-time for a desired 
period (e.g. it can be a few hours, a few days etc). This 
period is called Region Of Interest (ROI). Ideally, the 
ROI should be chosen to be long enough to represent 
the application access behavior. For instance, if the ROI 
for a medical database is chosen to be one day, then the 
memory access pattern of almost all the possible queries 
that are usually run on the database during the day can 
be covered by the ROI. In this situation, DReAM will 
estimate an optimized address-mapping scheme based 
on the average bit-change rate extracted from the ded¬ 
icated counters per physical address bits for the entire 
ROI. This new mapping will be saved on the memory 
controller and upon rebooting the system user has an 
option to choose the DReAM address mapping scheme 
over the baseline from the system BIOS. Thus, when¬ 
ever user reboots the system the memory controller can 
employ a new address mapping that is estimated based 
on DReAM calibration mode. A similar approach has 
been implemented for Intel-adaptive page policy and a 
special beta BIOS provided by ASUS that allows the 
user to choose a desired page closure policy at system 
start up [^ [^ . 

In this scenario, there is a penalty for the rebooting 
process but after that, as far the usual workloads run¬ 
ning on the system, the overall performance of memory 
system will be improved by taking advantage of new 
address-mapping scheme. This is why this scenario is 
well suited for the systems with consistent behavior over 
the time. 

3.3.2 Scenario 2 - Online Data Migration 

This scenario investigates the possibility of perform¬ 
ing on-the-fly data migration inside a DRAM device by 
proposing small modifications to the internal structure 
of this memory system. 

Basic Procedure: Figure [^presents the basic flowc¬ 
hart of servicing a memory request while using DReAM 


considering the second data migration scenario. To min¬ 
imize the overhead of migration, a row is migrated only 
when it has been accessed. In practice, this means that 
the migration occurs gradually on demand. 



Figure 8: DReAM flowchart. 


On the first access to a row, the requested physi¬ 
cal address is translated to the internal structure of 
the DRAM using both the Predefined Address-Mapping 
Scheme (PAMS) and the Estimated Address-Mapping 
Scheme (EAMS). The translated address by PAMS is 
the source row address and the translated address by 
EAMS is the destination row address. There are two 
main functions that might be applied on the requested 
address in different situations which are Migration and 
Swap. The requirement for these two functions and 
what they are will be discussed later on in this section 
and they are declared here just for initial familiarity to 
explain the flowchart. 

The first step is to determine if the accessed row is 
in its original location, pointed to by PAMS, or not. 
Two bits are dedicated to each row in a DRAM bank 
to keep track of the current status of that row: one bit 
(Migration-Bit) to determine if the row has been moved 
to its new location (migrated) and one bit (Swap-Bit) 
to determine if the row has been swapped (this process 
will be discussed later). Two tables can be dedicated to 
accommodate these bits for the entire DRAM module: 
the Migration Table (MT) and the Swap Table (ST). 
At this point several situations might happen: 

• If the requested row is in its original location (the 
migration-bit and swap-bit are 0) then, (i) the PAMS 
will be used to access and service the requested row, 
(ii) the requested row will be migrated to the destina¬ 
tion location pointed by EAMS, (iii) if the destination 
location is occupied by a different row then intuitively 
the content of destination row also needs to be mi¬ 
grated to a third place. This can produce a chain of 
unnecessary data migration which is costly. To avoid 
this, a simple row-swap algorithm is employed which 
means that in such situations the content of destina¬ 
tion row will be swapped by the content of source row 
(corresponding swap-bit will change to 1). 

• If the requested row has been migrated then the EAMS 
will be used to access and service the requested row 
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• If the requested row has been swapped then, (i) the 
swapped location will be calculated by applying the 
reverse address-mapping mechanism to the source lo¬ 
cation, (ii) step i will be repeated until the swap-bit 
of the pointed location by reverse address mapping 
scheme is 0, (iii) the request will be serviced, (iv) 
The requested row will be migrated to the destina¬ 
tion location pointed by EAMS, (v) a swap will be 
performed if it is necessary. 

To make all this happens inside DRAM some modi¬ 
fication needs to be done to the traditional structure of 
DRAMs which is explained below. 

Required DRAM Modification: There are two 
main requirements for DReAM to perform data migra¬ 
tion in a DRAM device: the capability of bulk data copy 
inside DRAM and the capability of on-the-fly buffering 
of the entire row to perform the swap operation. Both 
of these requirements have been studied individually by 
previous work to address different issues, using existing 
subarray level parallelism in DRAMs, ii which are 
described in the following. 

Bulk Data Copy in DRAM: Seshadri et al. 
exploits the existing subarrays per bank in DRAMs to 
copy the entire row from one location to another in¬ 
side DRAMs. Depending on the location of the source 
and destination rows, there are three different scenarios 
that should be considered: (i) copying between two rows 
within the same subarray (intra-subarray), (ii) copying 
between two rows in different subarrays in the same 
bank (inter-sub array), (iii) copying between two rows 
in different banks (inter-bank). 

Subarray-Level Parallelism: Kim et al. pro¬ 
posed some small modification to DRAMs to be able to 
exploit existing subarray level parallelism in DRAMs. 
They discussed three different levels of modification to 
DRAM to improve the access latency by making subar¬ 
rays working independently. Part of this work which, is 
more interesting from the point of view of this paper, is 
that called MASA. The key idea of MASA is to allow 
multiple activated subarrays in the same bank. MASA 
imposes (i) a designated-bit latch to each subarray, (ii) 
a new DRAM command, sub array-select (SA-SEL) and 
(iii) routing of a new global wire. Based on their ex¬ 
perimental methodology, they showed that the required 
extra latches imposes 0.15% area overhead and consume 
72.2 pW additional power for each ACTIVATE com¬ 
mand. Moreover, they evaluated that there is an extra 
0.56 mW of static power in the steady state imposed by 
multiple activation of subarrays. 

Having explained the above techniques, an overview 
of the DReAM architecture will be explained in the fol¬ 
lowing sections. 

3.4 DReAM - Overview of Architecture 

Eigurej^ presents a high-level overview of the DReAM 
architecture. DReAM includes two main phases, Address- 
Mapping Estimation and Online Data Migration. 

3.4.1 Address-Mapping Estimation 



Eigure 9: DReAM architecture. 


Address-Mapping Estimation requires minimal archi¬ 
tecture support. Only one counter per physical address 
bit, a history register to hold the last accessed address 
and an array of XORs to detect the bit-change between 
two consecutive memory requests are required to ex¬ 
tract the access pattern at run-time. Eigure pR| presents 
a simple overview of such a structure. In this struc¬ 
ture each bit of the currently accessed address will be 
XORed with the corresponding bit of the last accessed 
address. Then, if there is a difference in the accessed bit 
the corresponding counter will be incremented. As dis¬ 
cussed, this will produce a pattern of physical address 
bit changes over a period that can be employed to esti¬ 
mate an application-specific address-mapping scheme. 
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Eigure 10: DReAM monitoring counter structure. 


3.4.2 Data Migration - Operation 

The Data Migration required by DReAM can be de¬ 
scribed considering the following observations: 

Eirst, all the local row buffers (one local row buffer 
in each subarray) within a bank are connected to the 
global row buffer using global bitlines and all the row- 
buffers (either local or global) within a DRAM device 
are connected together using a narrow I/O bus (64-bit 
wide) [^[^. Second, considering the modification pro¬ 
posed by Kim et al. the DRAM module supports 
MASA. This supports multiple activation of subarrays 
while only one of them can be connected to the global 
bitline at a time. Eigure [^presents the possible scenar¬ 
ios that data migration might happen. To describe the 
following scenarios it is assumed that the destination 
row always has been occupied by another row (worst 
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case scenario) and thus a swap process is necessary. 



(a) Intra- 
Subarray 


(b) Inter- 
Subarray 


(c) Inter-Bank 


Figure 11: Different data migration scenarios. 

Inter and Intra bank Migration: although Fig¬ 
ure presents all different possible scenarios for data 
migration considering the main purpose of DReAM (i.e. 
reducing page conflicts) the first two scenarios are inef¬ 
ficient. The reason is that in the first two scenarios the 
data migration happens within the same bank which 
does not reduce the page conflict occurrence probabil¬ 
ity. Thus, there is no point of paying extra penalty to 
perform the data migration for these two scenarios. 


Inter-bank Migration: in this scenario (Figure 11c), 
source and destination rows are in different banks. There- 
fore, both of source and destination rows can be acti¬ 
vated in parallel. Thus, to perform this scenario the 
memory controller, (i) activates both source and desti¬ 
nation row and load their contents into their local row- 
buffer, (ii) puts bank A into the read mode and puts 
bank B into the write mode, (iii) transfers the source 
row from local row-buffer 1 in bank A to the global row 
buffer of the bank B using the narrow I/O bus, (iv) 
puts bank A into the write mode and puts bank B into 
the read mode, (v) transfers the destination row from 
local row-buffer 1 in bank B to the global row buffer of 
the bank A using the narrow I/O bus, (vi) connects the 
global bitlines of the global row-buffer in bank A to the 
source row and the global row-buffer of bank B to the 
destination row. 

3.4.3 Data Migration - Timing Overhead 

The latency overhead imposed by the data migration 
for each wor kload is the number of inter-bank migration 
(Figure 11c) times cost of transferring a row using the 
internal narrow I/O (i.e. 64-bit) bus. Considering the 
transfer rate of 64 bits/clock and a row buffer size of 4 
Kbit (per device) then 64 clock cycles are required to 
transfer a row from one bank to another. Another 64 
clock cycles are required in the case that a swap is nec¬ 
essary. Therefore in the worst case scenario, the penalty 
for each data relocation between two banks is 128 mem¬ 
ory clock cycles. Assuming that the CPU clock cycle is 
4 times faster than memory clock cycle then the data 
migration penalty is 512 CPU clock cycles. In a very 
pessimistic situation it is assumed that the processor 
will be stalled while the data migration is happening. 
Therefore the 512 clock cycles times number of required 
inter-bank data migrations delivers a good estimation 
of the extra overhead imposed on the overall execution 
time. 


3.4.4 Rollback Process to Avoid Degradation Loop 

DReAM predicts an application-specific address map¬ 
ping scheme based on the monitoring period of the past 
application access pattern. However, it is not guaran¬ 
teed that the application access pattern will not change 
again in the future. Therefore, the predicted address¬ 
mapping scheme by DReAM might not be efficient any¬ 
more and, as a result, using such an address-mapping 
scheme might degrade the performance of the DRAMs 
(i.e. Degradation Loop). To work around this issue, 
DReAM supports ‘Rollback’ procedure. As discussed, 
DReAM will switch to the predicted address-mapping 
scheme if the new mapping can improve the bit-change 
rate in comparison with the baseline, over a predefined 
threshold, for consecutive time windows. A similar ap¬ 
proach will be used to evaluate the efficiency of the 
predicted address-mapping at run-time. DReAM keeps 
monitoring the bit-change pattern over the time win¬ 
dows even after a new address-mapping scheme is pre¬ 
dicted. If the bit-change improvement of the predicted 
address mapping scheme no longer outperforms the base¬ 
line DReAM will switch back to the predefined address 
mapping scheme. This triggers the roll back function 
to return the migrated rows to their original location. 
In this situation the memory controller can switch be¬ 
tween at least two address-mapping scheme based on 
the application access pattern. A third address map¬ 
ping scheme can be employed if the rollback process 
completes which means that all the rows migrated by 
the previous address-mapping have returned to their 
original locations. 

4. EVALUATION METHODOLOGY 

Simulator: USIMM was used as the main sim¬ 
ulation platform for these experiments. USIMM was 
modified to support Permeation-based Page Interleav¬ 
ing 1 and Minimalist Open-Page scheme plus a full 
implementation of the DReAM architecture. DReAM 
was evaluated based on a 4 GB DRAM organized in 
1 channel, running single thread applications. To in¬ 
crease the randomness of memory access patterns the 
size of memory was fixed while running multithread ap¬ 
plications. A FR-FCFS scheduling algorithm is used 
in our experiments. Table presents the configuration 
parameters of USIMM. 


Model 

Description 

Value 

Processor 

Clock Speed 

ROB size 

3.2 GHz 

32 


Bus Speed 
Number of Channels 

800 MHz 

1 


Ranks per channel 

1 

Memory System 

Bank per rank 
Row per bank 

8 

65,536 


Cache line size 

64 Byte 


Table 1: USIMM configuration parameters. 

Address Mapping Schemes: The memory access 
pattern, and as a result the number of page conflicts in 
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SPEC 

PARSEC 

COMMERCIAL 

(a) GemsFDTD r 

(k) astar B 

(u) canneal 

(DI) commi 

(b) bzip2 l 

(1) bzip2 t 

(v) streamcluster 

(D2) comm2 

(c) cactusADM b 

(m) gcc I 

(w) blackschols 

(D3) comm3 

(d) gcc 2 

(n) gcc c 

(x) facesim 

(D4) comm4 

(e) gcc cp 

(o) gcc-g 

(y) ferret 

(D5) comm5 

(f) gcc sc 

(p) mcf r 

(z) fluidanimate 

BIOBENCH 

(g) milc s 

(q) omnetpp o 

(A) freqmine 

(E) mummer 

(h) soplex r 

(r) sphinx3 a 

(B) swaption 


(i) xalancbmk r 

(s) zeusmp z 

HPC 


(j) libquantum 

(t) leslie 

(C) hpci - hpcI3 



Table 2: Evaluated workloads and benchmark suites. 


DRAMs, can be affected by the predefined memory ad¬ 
dress mapping scheme. The experiments consider three 
different address mappings presented in Fi gure [3l The 


experimental results presented in Section 2.1 (Figure 


show that the Permutation-based Page interleaving 
policy (Mapping 2) performs best for most of the work¬ 
loads. Therefore, this address mapping scheme is em¬ 
ployed as a fair baseline to compare with the DReAM 
scheme. 

Workloads: the workloads include a wide range of 
memory intensive applications (i.e. 48 workloads) from 
different benchmark suites (PARSEC [^, SPEC 1^, 
BIOBENCH [T^, HPC and COMMERCTAL) and rep¬ 


resentative regions of interest for each application. Ta¬ 
ble lists the workloads and their corresponding bench¬ 
mark suites. An identifier is assigned to each appli¬ 
cation to facilitate the naming of multithread work¬ 
loads constructed from these applications. To increase 
the variety of memory access patterns, USIMM was 
set up for multithread applications to evaluate 20 ran¬ 
domly selected workload mixes; a combination of 4- 
thread and 8-thread applications. Table lists these 
multithread workloads employing the identifier of sin¬ 
gle thread workloads presented in Table 


Multithread Workloads 

Ml;C13-C5-x-t 

MII:C9-CI3-C5-w-x-t-j-q 

M2:C9-w-j-q 

MI2:C8-C3-w-x-y-a-t-j 

M3:w-x-y-t 

MI3:C8-C5-x-y-a-t-p-q 

M4;C8-C5-t-p 

M14:C9-C12-C13-C9-C12-C12-p-q 

M5:t-t-p-g 

MI5:CI3-x-t-g-p-t-p-g 

M6:C8-w-p-q 

M16:C8-C3-C5-w-C5-C5-p-q 

M7;C3-C5-C5-C5 

MI 7: C9-w-y-w-w-a-t-g 

M8:C9-w-y-w 

M18:C13-C3-x-C13-a-a-p-g 

M9;C12-C13-a-a 

M19:C12-C13-y-a-a-a-g-q 

MI0:x-t-j-q 

M20:x-y-p-a-x-a-p-q 


Table 3: Randomly selected multithread workloads. 

5. RESULTS AND DISCUSSIONS 
5.1 Performance Analysis 

In this section the performance of DReAM is inves¬ 
tigated. Before jumping to the result graphs, the fol¬ 


lowing summary might be helpful: (i) The performance 
numbers presented in this section are normalized to the 
baseline (Permutation-based address mapping) which 
delivers the best average execution time among three 
address-mapping schemes presented in Figure]^ (ii) As 
discussed, the offline mapping is desired only in the case 
of applications with a consistent behavior and will be 
achieved after a long calibration period. Therefore, the 
rebooting cost will be negligible considering the long- 
period running application. Thus, in the results pre¬ 
sented in Figures to the cost of rebooting is ig¬ 
nored in the case of DReAM-Offline and only the effi¬ 
ciency of the address mapping detected by DReAM is 
investigated, in comparison with the baseline mapping. 

Figurepresents the execution time for BIOBENCH 
and PARSEC benchmarks. This result suggests that 
the baseline address-mapping scheme is good enough 
for the workloads presented in these benchmarks and 
DReAM does not have margin to predict a better ad¬ 
dress mapping scheme. Therefore, there is no bit-change 
rate improvement when using DReAM in comparison 
with the baseline. The small degradation by DReAM- 
Offline (i.e. around 1%) manifested in Figure 12 is due 
to slightly different access patterns caused by reordering 
the baseline address bits. This can be counted as noise. 
On the other hand, DReAM-Online mitigates this is¬ 
sue by on-the-fly checking the bit-change improvement, 
between two consecutive time windows, against a pre¬ 
defined threshold. For instance in these experiments 
DReAM-Online employs the new address mapping only 
if it can improve the bit change rate by more than 7%. 
Thus, although DReAM cannot predict a better address 
mapping scheme than the baseline it does not degrade 
the performance for most of the cases. A similar behav¬ 
ior can be observed in Figures p!3][T6l 

Overall, DReAM-Offline outperforms the permuta¬ 
tion based address-mapping scheme (the best evaluated 
baseline) by 5%, on average, and up to 28% across all 
the workloads. In the case of DReAM-Online, 12 work¬ 
loads satisfy the DReAM’s threshold at run-time (i.e. 
improve the bit change rate by more than 7%) and for 
these workloads DReAM-Online outperforms the base¬ 
line by 4.5%, on average, and up to 23%. Figure 16 


depicts the execution time for the randomly selected 
multithread workloads presented in Table These re- 
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■ Baseline ^ DReAM - Offline □ DReAM - Online 
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Figure 12: Results for BIOBENCH and COMMERCIAL benchmark suites. 
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Eigure 13: Results for HPC benchmarks. 

■ Baseline ^ DReAM - Offline □ DReAM - Online 



black canneal face ferret fluid freq stream swapt 


Eigure 14: Results for the PARSEC benchmark suite. 


■ Baseline ^ DReAM - Offline □ DReAM - Online 



Eigure 15: Results for the SPEC benchmark suite. 
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DReAM - Offline □ DReAM - Online 



I 0.9 ^ 
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Eigure 16: Results for multithreaded benchmarks. 


suits show that DReAM can still predict a better ad¬ 
dress mapping scheme than the baseline even in the 
case of multithread workloads which produce a highly 
random memory access pattern. Looking into the re¬ 
sults from a different angle suggests that DReAM out¬ 
performs the best evaluated baseline address mapping 
on average by 9% and 2% for mapping-sensitive and 
mapping-insensitive workloads respectively. 

Considering the results presented in Eigure p!5l libquan- 


tum achieves a significant performance improvement tak¬ 
ing advantage of DReAM. To understand this outcome, 
it is useful to check the extracted pattern for this work¬ 
load presented earlier in Eigure This figure shows 
that there is a high change rate for bit 14. This bit is 
mapped to the rows address space increasing the possi¬ 
bility of accessing different rows within the same bank 
(i.e. Page Conflict) and so imposes a significant per¬ 
formance overhead. DReAM simply assigns this bit to 
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the another address space (e.g. bank or column address 
space) by replacing it with a bit with a minimal change 
rate. In this situation, the excessive change rate of this 
bit increases the possibility of interleaving the accesses 
to different banks which improves the level of paral¬ 
lelism (or access locality, if using column address space) 
in the system and as a result improves the performance 
significantly. A similar argument explain the significant 
performance for the other workloads, such as ‘face’. 


5.2 Data Relocation Analysis 

As discussed, the data relocation required by DReAM 
is composed of two main phases: Migration and Roll¬ 
back. In the following some statistical analysis of mi¬ 
grations and rolls back required by DReAM will be dis¬ 
cussed. 

Migration vs. Rollback: The experimental results 
(presented in Figure [T^ to Figure 15) show that 12 stan¬ 
dard workloads undergo dynamicaata relocation. Out 
of these 12 workloads only two workloads require data 
rollback which are ‘ferret’, with 10% of data relocation 
spent on data rollback, and ‘libquantum’, with 39% of 
data relocation spent on data rollback. 

Inter Bank vs. Intra Bank Data Relocation: 
The results show that 87.5% of data relocation hap¬ 
pens between banks (Inter-bank relocation) and 12.5% 
happens within banks (Intra-bank relocation). As men¬ 
tioned, DReAM does not perform the Intra-bank sce¬ 
narios to reduce the cost of data relocation. 


5.3 Storage Overhead and Scalability 

Address Mapping Prediction: As discussed, there 
is only one counter and one XOR gate per physical ad¬ 
dress bit plus one history buffer to keep track of the last 
access address is required to extract the monitoring pat¬ 
tern. Thus, assuming a sampling window of 250K mem¬ 
ory requests, 18-bit counters times the number of phys¬ 
ical address bits are the main storage overhead for the 
first phase of DReAM. Our experimental results show 
that this number is no more than 60 bytes for 512 GB 
DRAM. 

Data Migration: the online data migration requires 
to keep track of migrated and swapped pages. Therefore 
the required MT and ST impose extra storage overhead 
to the overall memory system. Figure depicts the 
overall storage overhead imposed by online data migra¬ 
tion. This result shows that DReAM imposes a storage 
overhead of 3x 10“^% to the overall DRAM size. De¬ 
pending on the implementation choice the MT and ST 
can be implemented as a part of the memory controller 
or as Metadata inside the DRAM. 


6. RELATED WORK 

The shortfalls of DRAMs with respect to page con¬ 
flicts are widely recognized in the area of memory sys¬ 
tem design. Prior work proposed a wide range of dif¬ 
ferent techniques such as memory interleaving schemes, 
scheduling algorithms and some architectural modifica¬ 
tions to the current structure of DRAMs to mitigate 
this issue. For instance, Zhang et al. proposed a 



Figure 17: DReAM data migration overhead. 


page interleaving scheme to reduce page conflicts and 
exploit data locality. Hsu et al. 14 proposed another 


memory interleaving scheme to address the same issue. 
There are many other interesting works in the area of 
developing new scheduling algorithms 


20 21 that prioritize servicing certain memory requests 
to Induce page conflicts and improve the memory per¬ 
formance. Some other types of work in this area are 
those that propose either a new architecture for DRAMs 
or a small modification to the traditional structure of 
these memory systems. For instance, Sudan et al. 
proposes a technique to recognise the highly access^ 
data in DRAM and place them in the same row to im¬ 
prove the data locality. Kim et al. proposed a tech¬ 
nique to exploit the existing subarray level parallelism 
in DRAMs to imj 
Bojnordi et al. \ 


»rove the bank conflicts. PARDIS by 


J is a programmable memory con¬ 
troller that can be configured using a specific instruc¬ 
tion set architecture (ISA). Although the focus of this 
work was not on developing optimized address-mapping 
scheme they configured PARDIS by the application- 
specific address mapping heuristic achieved by offline 
profiling analysis and presented a good performance im¬ 
provement in the memory system. 


7. CONCLUSIONS 

This paper has introduced DReAM which is a novel 
hardware technique based on approximating the en¬ 
tropy of each memory address bit for a set of memory 
requests. DReAM presents three main contributions: 
first, a low-cost pattern recognition technique is devel¬ 
oped to extract the memory access pattern at run-time. 
Then, a methodology is proposed to estimate an op¬ 
timized address-mapping based on the detected access 
pattern. Finally, a technique is proposed for the on- 
the-fly migration of data within DRAMs to reduce page 
conflicts. An extensive performance evaluation was car¬ 
ried out with 48 different workloads from 5 benchmark 
suites and 20 multithreaded applications. In summary, 
DReAM-Offline outperforms the permutation-based ad¬ 
dress mapping scheme (the state-of-the-art mapping) 
by 5%, on average, and up to 28% across all work¬ 
loads. In the case of DReAM-Online, 12 workloads 
satisfy DReAM’s threshold at run-time (i.e. improve 
the bit change rate over 7%) and for these workloads 
DReAM-Online outperforms the baseline by 4.5%, on 
average, and up to 23%. Categorising workloads to 
mapping sensitive and insensitive, DReAM outperforms 
the best evaluated baseline address mapping on average 
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by 9% and 2% for the first and second category respec¬ 
tively. Overall, DReAM is the first on-the-fiy mecha¬ 
nism capable of generating workload specific address- 
mappings without requiring running applications to be 
stopped. 
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